Building a new layer

This notebook will guide you through implementing a custom layer in neon, as well as a custom activation function. You will learn

  • general interface for defining new layers
  • using the nervana backend functions

Preamble

The first step is to set up our compute backend, and initialize our dataset.


In [ ]:
import neon
print neon.__version__

# use a GPU backend
from neon.backends import gen_backend
be = gen_backend('gpu', batch_size=128)

# load data
from neon.data import MNIST

mnist = MNIST(path='../data/')
train_set = mnist.train_iter
test_set = mnist.valid_iter

Build your own layer

Instead of importing the neon supplied Affine Layer, we will instead build our own.

Note: Affine is actually a compound layer; it bundles a linear layer with a bias transform and an activation function. The Linear layer is what implements a fully connected layer.

First, lets build our own linear layer, called MyLinear, and then we will wrap that layer in a compound layer MyAffine.

There are several important components to a layer in neon:

  • configure: during model initialization, this layer will receive the previous layer's object and use it to set this model's in_shape and out_shape attributes.
  • allocate: after each layer's shape is configured, this layer's shape information will be used to allocate memory for the output activations from fprop.
  • fprop: forward propagation. Should return a tensor with shape equal to the layer's out_shape attribute.
  • bprop: backward propagation.

In the implementation below, fprop is implemented using element-wise operations. It will be very slow. Try replacing it with the neon backend implementation of compound_dot, such as in the bprop function.


In [ ]:
from neon.layers.layer import ParameterLayer, interpret_in_shape

# Subclass from ParameterLayer, which handles the allocation
# of memory buffers for the output activations, weights, and 
# bprop deltas.
class MyLinear(ParameterLayer):

    def __init__(self, nout, init, name=None):
        super(MyLinear, self).__init__(init, name, "Disabled")
        self.nout = nout
        
        # required attributes
        self.inputs = None  #..?
        self.in_shape = None  # shape of the inputs to this layer
        self.out_shape = None  # shape of the outputs from this layer

    def __str__(self):
        return "Linear Layer '%s': %d inputs, %d outputs" % (
               self.name, self.nin, self.nout)

    def configure(self, in_obj):
        """
        Configure the layer's input shape and output shape attributes. This is
        required for allocating the output buffers.
        """
        super(MyLinear, self).configure(in_obj)
        
        # shape of the input is in (# input features, batch_size)
        (self.nin, self.nsteps) = interpret_in_shape(self.in_shape)
        
        # shape of the output is (# output units, batch_size)
        self.out_shape = (self.nout, self.nsteps)
        
        # if the shape of the weights have not been allocated,
        # we know that his layer's W is a tensor of shape (# outputs, # inputs).
        if self.weight_shape is None:
            self.weight_shape = (self.nout, self.nin)
      
        return self
    
    # We use the superclass' allocate() method.
    # for a general layer, where you may have other memory allocations
    # needed for computations, you can implement allocate() with
    # your own variables.
    #
    # def allocate(self)

    # fprop function
    # * inference flag can be used to not store activations that may be unneeded
    # * beta...?
    def fprop(self, inputs, inference=False, beta=0.0):
        self.inputs = inputs

        # here we compute y = W*X inefficiently using the backend functions
        # try substituting this with the backend `compound_dot` function to see
        # the speed-up from using a custom kernel!
        for r in range(self.outputs.shape[0]):
            for c in range(self.outputs.shape[1]):
                self.outputs[r,c] = self.be.sum(self.be.multiply(self.W[r], self.inputs[:,c].T))
    
        # self.be.compound_dot(A=self.W, B=self.inputs, C=self.outputs, beta=beta)
        return self.outputs

    def bprop(self, error, alpha=1.0, beta=0.0):
        
        # to save you headache, we use the backend compound_dot function here to compute
        # the back-propogated deltas = W^T*error.
        if self.deltas:
            self.be.compound_dot(A=self.W.T, B=error, C=self.deltas, alpha=alpha, beta=beta)
        self.be.compound_dot(A=error, B=self.inputs.T, C=self.dW)
        return self.deltas

Wrap the above layer in a container, which bundles an activation and batch normalization.


In [ ]:
from neon.layers.layer import CompoundLayer
class MyAffine(CompoundLayer):

    def __init__(self, nout, init, bias=None,
                 batch_norm=False, activation=None, name=None):
        super(MyAffine, self).__init__(bias=bias, activation=activation, name=name)
        self.append(MyLinear(nout, init, name=name))
        self.add_postfilter_layers()

Defining an activation function (transform)

We can understand more the backend functions by implementing our own softmax function.


In [ ]:
from neon.transforms.transform import Transform

class MySoftmax(Transform):
    """
    SoftMax activation function. Ensures that the activation output sums to 1.
    """
    def __init__(self, name=None, epsilon=2**-23):
        """
        Class constructor.
        Arguments:
            name (string, optional): Name (default: none)
            epsilon (float, optional): Not used.
        """
        super(MySoftmax, self).__init__(name)
        self.epsilon = epsilon

    def __call__(self, x):
        """
        Implement the softmax function. The input has shape (# features, batch_size) and
        the desired output is (# features, batch_size), but where the features sum to 1.
        We use the formula:
        
        f(x) = e^(x-max(x)) / sum(e^(x-max(x))) 
        """
        return (self.be.reciprocal(self.be.sum(
                self.be.exp(x - self.be.max(x, axis=0)), axis=0)) *
                self.be.exp(x - self.be.max(x, axis=0)))

    def bprop(self, x):
        """
        We take a shortcut here- the derivative cancels out with the CrossEntropy term.
        """
        return 1

Putting together all of the pieces

The architecture here is the same as in the mnist_mlp.py example, instead here we use our own MyAffine layer and MySoftmax activation function.


In [ ]:
from neon.initializers import Gaussian
from neon.models import Model
from neon.transforms.activation import Rectlin

init_norm = Gaussian(loc=0.0, scale=0.01)

# assemble all of the pieces
layers = []
layers.append(MyAffine(nout=100, init=init_norm, activation=Rectlin()))
layers.append(MyAffine(nout=10, init=init_norm, activation=MySoftmax()))

# initialize model object
mlp = Model(layers=layers)

Fit

Using Cross Entropy loss and Gradient Descent optimizer, train the model. This will be slow, because our fprop is inefficient. Replace the fprop function using the backend's compound_dot method!


In [ ]:
from neon.layers import GeneralizedCost
from neon.transforms import CrossEntropyMulti
from neon.optimizers import GradientDescentMomentum
from neon.callbacks.callbacks import Callbacks

cost = GeneralizedCost(costfunc=CrossEntropyMulti())
optimizer = GradientDescentMomentum(0.1, momentum_coef=0.9)
callbacks = Callbacks(mlp, eval_set=test_set)

mlp.fit(train_set, optimizer=optimizer, num_epochs=10, cost=cost,
        callbacks=callbacks)